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This  paper  presents  a  new  randomized  search  method  called  Evolutionary  Random  Policy 
Search  (ERPS)  for  solving  infinite  horizon  discounted  cost  Markov  Decision  Process  (MDP) 
problems.  The  algorithm  is  particularly  targeted  at  problems  with  large  or  uncountable 
action  spaces.  ERPS  approaches  a  given  MDP  by  iteratively  dividing  it  into  a  sequence  of 
smaller,  random,  sub-MDP  problems  based  on  information  obtained  from  random  sampling 
of  the  entire  action  space  and  local  search.  Each  sub-MDP  is  then  solved  approximately 
by  using  a  variant  of  the  standard  policy  improvement  technique,  where  an  elite  policy  is 
obtained.  We  show  that  the  sequence  of  elite  policies  converges  to  an  optimal  policy  with 
probability  one.  An  adaptive  version  of  the  algorithm  that  improves  the  efficiency  of  the 
search  process  while  maintaining  the  convergence  properties  of  ERPS  is  also  proposed.  Some 
numerical  studies  are  carried  out  to  illustrate  the  algorithm  and  compare  it  with  existing 
procedures. 

Keywords:  Markov  decision  process  (MDP),  policy  iteration  (PI),  evolutionary  policy  it¬ 
eration  (EPI),  genetic  algorithms  (GAs),  global  optimization. 
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1  Introduction 


From  operations  research  to  artificial  intelligence,  a  broad  range  of  problems  can  be  formu¬ 
lated  and  described  by  Markov  decision  process  (MDP)  models.  Up  to  now,  the  majority  of 
the  solution  methods  have  concentrated  on  reducing  the  size  of  the  state  space  in  order  to 
address  the  well-known  “curse  of  dimensionality”  (i.e.,  the  typical  exponential  growth  in  the 
state  space  size  with  the  parameters  of  the  problem).  Some  well-established  approaches  are 
state  aggregation  (Bertsekas  and  Castanon  1989),  feature  extraction  (Tsitsiklis  and  Van  Roy 
1996),  random  successive  approximations  and  random  multigrid  algorithms  (Rust  1997),  and 
various  value  function  approximation  schemes  via  the  use  of  basis  functions  (de  Farias  and 
Van  Roy  2003,  Trick  and  Zin  1997),  just  to  name  a  few.  The  key  idea  throughout  is  to 
avoid  enumerating  the  entire  state  space.  However,  most  of  the  above  approaches  generally 
require  the  ability  to  search  the  entire  action  space  in  order  to  choose  the  best  action  at  each 
step  of  the  iteration  procedure;  thus  problems  with  very  large  action  spaces  may  still  pose  a 
computational  challenge. 

The  approach  proposed  in  this  paper  is  meant  to  complement  these  highly  successful 
techniques.  In  particular,  we  will  focus  on  MDPs  where  the  state  space  is  relatively  small  but 
the  action  space  is  very  large,  so  that  enumerating  the  entire  action  space  becomes  practically 
inefficient.  These  types  of  problems  often  arise  in  control  of  queues.  For  example,  consider 
the  problem  of  controlling  the  service  rate  of  a  single-server  queue  with  a  finite  buffer  size, 
say  M,  in  order  to  minimize  the  average  number  of  jobs  in  queue  and  the  service  cost.  The 
state  space  of  this  problem  is  the  possible  number  of  jobs  in  the  queue  {0, 1, ... ,  M},  so  the 
size  of  the  state  space  is  M  +  l,  whereas  the  possible  actions  might  be  all  values  on  an  interval 
(e.g.,  representing  a  service  rate),  in  which  case  the  action  space  is  uncountable.  From  a 
more  general  point  of  view,  if  one  of  the  aforementioned  state  space  reduction  techniques  is 
considered,  for  instance,  say  state  aggregation,  then  MDPs  with  small  state  spaces  and  large 
action  spaces  can  also  be  regarded  as  the  outcomes  resulting  from  the  aggregation  of  MDPs 
with  large  state  and  action  spaces. 

The  issue  of  large  action  spaces  was  addressed  in  early  work  by  MacQueen  (1966),  who 
used  some  inequality  forms  of  Bellman’s  equation  together  with  bounds  on  the  optimal 
value  function  to  identify  and  eliminate  non-optimal  actions  in  order  to  reduce  the  size  of 
the  action  sets  to  be  searched  at  each  iteration  of  the  algorithm.  Since  then,  the  procedure 
has  been  applied  to  several  standard  methods  like  policy  iteration  (PI),  value  iteration  (VI) 
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and  modified  policy  iteration  (cf.  e.g.,  Puterman  1994  for  a  review).  All  of  these  algorithms 
generally  require  that  the  admissible  set  of  actions  at  each  state  is  finite.  In  this  paper,  we 
approach  this  problem  in  an  entirely  different  manner,  by  using  an  evolutionary,  population- 
based  approach,  and  directly  searching  the  policy  space  to  avoid  carrying  out  an  optimization 
over  the  entire  action  space.  Such  an  approach  has  proven  effective  in  attacking  many 
traditional  optimization  problems,  both  deterministic  (combinatorial)  and  stochastic.  Our 
work  applies  this  approach  to  infinite  horizon  discounted  cost  MDP  models  and  results  in  a 
novel  algorithm  we  call  Evolutionary  Random  Policy  Search  (ERPS). 

For  a  given  MDP  problem,  ERPS  proceeds  iteratively  by  constructing  and  solving  a 
sequence  of  sub-MDP  problems,  which  are  MDPs  defined  on  smaller  policy  spaces.  At  each 
iteration  of  the  algorithm,  two  steps  are  fundamental:  (1)  The  sub-MDP  problem  constructed 
in  the  previous  iteration  is  approximately  solved  by  using  a  variant  of  the  policy  improvement 
technique  called  policy  improvement  with  cost  swapping  (PICS),  and  a  policy  called  an  elite 
policy  is  generated.  (2)  Based  on  the  elite  policy,  a  group  of  policies  is  then  obtained  by  using 
a  “nearest  neighbor”  heuristic  and  random  sampling  of  the  entire  action  space,  from  which 
a  new  sub-MDP  is  created  by  restricting  the  original  MDP  problem  (e.g.,  cost  structure, 
transition  probabilities)  on  the  current  available  subsets  of  actions.  Under  some  appropriate 
assumptions,  we  show  that  the  sequence  of  elite  policies  converges  with  probability  one  to 
an  optimal  policy. 

We  briefly  review  the  relatively  sparse  research  literature  applying  evolutionary  search 
methods  such  as  genetic  algorithms  (GAs)  and  simulated  annealing  algorithms  (SAs)  for 
solving  MDPs.  Wells  et  al.  (1999)  use  GAs  to  find  good  finite  horizon  policies  for  partially 
observable  MDPs  and  discuss  the  effects  of  different  GA  parameters.  Barash  (1999)  pro¬ 
poses  a  genetic  search  in  policy  space  for  solving  infinite  horizon  discounted  MDPs,  and  by 
comparing  with  the  standard  policy  iteration  (PI),  concludes  that  it  is  unlikely  that  policy 
search  based  on  GAs  can  offer  a  competitive  approach  in  cases  where  PI  is  implementable. 
More  recently,  Chang  et  al.  (2002)  propose  an  algorithm  called  evolutionary  policy  itera¬ 
tion  (EPI)  for  solving  infinite  horizon  discrete  MDPs  with  large  action  spaces  by  imitating 
the  standard  procedures  of  GAs.  Although  the  algorithm  is  guaranteed  to  converge  with 
probability  one,  no  performance  comparisons  with  existing  techniques  are  provided,  and  the 
theoretical  convergence  requires  the  action  space  to  be  finite. 

ERPS  shares  some  similarities  with  the  EPI  algorithm  introduced  in  Chang  et  al.  (2002), 
where  a  sequence  of  “elite”  policies  is  also  produced  at  successive  iterations  of  the  algorithm. 
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However,  the  fundamental  differences  are  that  in  EPI,  policies  are  treated  as  the  most  es¬ 
sential  elements  in  optimization,  and  each  “elite”  policy  is  directly  generated  from  a  group 
of  policies.  On  the  other  hand,  in  our  approach,  policies  are  regarded  as  intermediate  con¬ 
structions  from  which  sub-MDP  problems  are  then  constructed  and  solved;  EPI  follows  the 
general  framework  of  GAs,  and  thus  operates  only  at  the  global  level,  which  usually  results  in 
slow  convergence.  In  contrast,  ERPS  combines  global  search  with  a  local  enhancement  step 
(the  “nearest  neighbor”  heuristic)  that  leads  to  rapid  convergence  once  a  policy  is  found  in  a 
small  neighborhood  of  an  optimal  policy.  We  argue  that  our  approach  substantially  improves 
the  performance  of  the  EPI  algorithm  while  maintaining  the  computational  complexity  at 
relatively  the  same  level. 

Perhaps  the  most  straightforward  and  the  most  commonly  used  numerical  approach  in 
dealing  with  MDPs  with  uncountable  action  spaces  is  via  the  use  of  discretization  (cf.  Rust 
1997).  In  practice,  this  could  lead  to  computational  difficulties,  either  resulting  in  an  action 
space  that  is  too  large  or  in  a  solution  that  is  not  accurate  enough.  In  contrast,  our  approach 
works  directly  on  the  action  space,  requiring  no  explicit  discretization,  and  the  adaptive  ver¬ 
sion  of  the  algorithm  we  proposed  improves  the  efficiency  of  the  search  process  and  produces 
high  quality  solutions.  As  in  standard  approaches  such  as  PI  and  VI,  the  computational 
complexity  of  each  iteration  of  ERPS  is  polynomial  in  the  size  of  the  state  space,  but  unlike 
these  procedures,  it  is  insensitive  to  the  size  of  the  action  space,  making  the  algorithm  a 
promising  candidate  for  problems  with  relatively  small  state  spaces  but  uncountable  action 
spaces. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2,  we  begin  with  the  problem 
setting.  In  Sections  3  and  4,  we  give  a  detailed  description  of  the  ERPS  algorithm  and 
present  some  convergence  results.  In  Section  5,  we  introduce  an  adaptive  version  of  ERPS. 
Numerical  studies  and  comparisons  with  EPI  and  PI  are  reported  in  Section  6.  Finally  some 
future  research  topics  are  outlined  in  Section  7. 

2  Problem  Setting 

We  consider  the  infinite  horizon  discounted  cost  MDP  problem  G  =  (X,  A,  P ,  R,  a)  with 
finite  state  space  X,  a  general  action  space  A,  a  bounded  non-negative  cost  function  R  : 
X  x  A  — >  K+  U  {0},  a  fixed  discount  factor  a  G  (0, 1),  and  a  transition  function  P  that  maps 
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a  state-action  pair  to  a  probability  distribution  over  X.  The  probability  of  transitioning  to 
state  y  G  X  given  that  we  are  in  state  x  taking  action  a  G  A,  is  denoted  by  Px,y{a). 

Throughout  this  paper,  unless  otherwise  specified,  we  will  always  denote  the  set  of  all 
stationary  deterministic  policies  n  :  X  — >  A  by  II  and  the  size  of  the  state  space  by  |A"|. 
And  without  lost  of  generality,  we  assume  that  all  actions  a  G  A  are  admissible  for  all  states 

x  e  x. 

Define  the  optimal  value  associated  with  an  initial  state  x  as 
J*(x)  =  inf^n  Jn(x),  where 

Jn(x)  =  ^E~0Q!t-R(xt,7r(xt))|xo  =  x\ ,  x  G  X,  a  G  (0, 1), 

Xt  is  the  state  of  the  MDP  at  time  t,  and  E(-)  is  understood  with  respect  to  the  sample 
space  induced  by  the  transition  probabilities.  Assume  there  exists  a  stationary  optimal 
policy  7i*  G  II  that  achieves  the  optimal  value  J*(x)  for  all  initial  states  x  G  X.  The 
problem  is  to  find  such  a  policy  7 r*. 

3  Evolutionary  Random  Policy  Search 

A  high-level  description  of  the  ERPS  algorithm  is  summarized  in  Figure  1.  We  will  provide 
a  detailed  discussion  in  the  following  subsections. 

3.1  Initialization 

We  start  by  specifying  an  action  selection  distribution  V,  the  exploitation  probability  q0  G 
[0, 1],  the  population  size  n,  and  a  search  range  rt  for  each  state  Xi  G  X.  Once  chosen,  these 
parameters  are  fixed  throughout  the  algorithm.  We  then  select  an  initial  group  of  policies; 
however,  because  of  the  exploration  step  used  in  ERPS,  the  performance  of  the  algorithm  is 
relatively  insensitive  to  this  choice.  One  simple  method  is  to  choose  each  individual  policy 
uniformly  from  the  policy  space  II. 

The  action  selection  distribution  V  is  &  probability  distribution  over  the  action  space,  and 
will  be  used  to  generate  sub-MDPs  (cf.  Section  3.3).  Note  that  V  could  be  state  dependent  in 
general,  i.e.,  we  could  prescribe  for  each  state  x  G  X  a  different  action  selection  distribution 
according  to  some  prior  knowledge  of  the  problem  structure.  One  simple  choice  of  V  is 
the  uniform  distribution.  The  exploitation  probability  qo  and  the  search  range  rq  will  be 
used  to  construct  sub-MDPs;  the  detailed  discussion  of  these  two  parameters  is  deferred  to 
Section  3.3. 
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Evolutionary  Random  Policy  Search  (ERPS) 

•  Initialization:  Specify  an  action  selection  distribution  V ,  the  population  size  n  >  1,  and  the 
exploitation  probability  go  G  [0, 1].  Specify  a  search  range  r,  for  each  state  Xi  £  X,  i  =  1, . . . ,  |X|. 
Select  an  initial  population  of  policies  A0  =  {71-°,  7^, . . . ,  7r°  } .  Construct  the  initial  sub-MDP  as 
Qa0  '■=  (■ X,T0,P,R,a ),  where  T0  =  U*  A0(a;).  Set  :=  7r?,  k  =  0. 

•  Repeat  until  a  specified  stopping  rule  is  satisfied: 


Policy  Improvement  with  Cost  Swapping  (PICS): 

k  7 

*  Obtain  the  value  function  J  i  for  each  n*  £  A*,. 

*  Generate  the  elite  policy  for  Q\k  as 

7 r*  (a;)  =  arg  min  <  R(x,  u )  +  a  Px  y  ( u )  [  min  JA/  (y)]  >  ,  \/x  £  X. 

ueAk(x)  {  y  '  ^eAk  '  J 

Sub-MDP  Generation: 

*  for  j  =  2  to  n 

for  i  =  1  to  |X| 

generate  a  random  sample  u  from  the  uniform  distribution  over  [0,1], 
if  u  <  qo  (exploitation) 

choose  the  action  7 Tj+1(xi)  in  the  neighborhood  of  7r *(#i) 
by  using  the  “nearest  neighbor”  heuristic, 
elseif  u  >  go  (exploration) 

choose  the  action  nj+1(xi)  €  A  according  to  V. 

end  if 
end  for 
end  for 

*  Set  the  next  population  generation  as  A^+i  =  {71^,  7r2+1,  •  •  •  ,7r^+1}. 

*  Construct  a  new  sub-MDP  problem  as  Q\k+1  :=  (X,  Tfc+i,  P,  R,  a), 
where  rfc+i  =  (Jx  Afc+1(a;). 

*  k  <—  k  +  1. 


Figure  1:  Evolutionary  Random  Policy  Search 


3.2  Policy  Improvement  with  Cost  Swapping 

As  mentioned  earlier,  the  idea  behind  ERPS  is  to  randomly  split  a  large  MDP  problem  into  a 
sequence  of  smaller,  manageable  MDPs,  and  to  extract  a  possibly  convergent  sequence  of  poli- 
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cies  via  solving  these  smaller  problems.  For  a  given  policy  population  A  =  {7Ti,  7r2, . . . ,  7 rn}, 
if  we  restrict  the  original  MDP  (e.g.,  costs,  transition  probabilities)  on  the  subsets  of  actions 
A(x)  :=  (7Ti(x),  7t2(x), . . . ,  7rn(x)}  Vi  6  I,  then  a  sub-MDP  problem  is  induced  from  A  as 
Q,\  :=  (X,  T,  P,  R,  a),  where  V  :=  (J.t.  A(x).  Note  that  in  general  A(x)  is  a  multi-set,  which 
means  that  the  set  may  contain  repeated  elements;  however,  we  can  always  discard  the  re¬ 
dundant  members  and  view  A(x)  as  the  set  of  admissible  actions  at  state  x.  Since  ERPS  is 
an  iterative  random  search  algorithm,  rather  than  attempting  to  solve  Q\  exactly,  it  is  more 
efficient  to  utilize  some  approximation  schemes  and  to  obtain  an  improved  policy  and/or 
good  candidate  policies  with  some  worst-case  performance  guarantee. 

Here  we  adopt  a  variant  of  the  policy  improvement  technique  to  find  an  “elite”  policy, 
one  that  is  superior  to  all  of  the  policies  in  the  current  population,  by  executing  the  following 
two  steps: 

Step  1  :  Obtain  the  value  functions  Jnj ,  j  —  1, . . .  ,n,  by  solving  the  equations: 

Jnj(x)  =  R(x,  nj(x))  +  a  ^  PXty(nj(x))  Jnj(y),  VxG  Ah  (2) 

y 

Step  2:  Compute  the  elite  policy  7r*  by 

7T*  (x)  =  arg  min  <  R{x ,  u)  +  a  >  Px  y{u)  [min  J7Tj  (?/)]>,  VxeX.  (3) 

«6A(x)  [  y  ^eA  J 

Since  in  equation  (3),  we  are  basically  performing  the  policy  improvement  on  the  “swapped 
cost”  miring  a  J71'1  (. x ),  we  call  this  procedure  “policy  improvement  with  cost  swapping”  (PICS). 
We  remark  that  the  “swapped  cost”  miring  a  .W'  (x)  may  not  be  the  value  function  corre¬ 
sponding  to  any  policy.  We  now  show  that  the  elite  policy  generated  by  PICS  improves  any 
policy  in  A. 

Theorem  1  Given  A  =  {7Ti,  7t2,  -  -  - ,  vrn}7  let  J(x)  =  miring  a  JWj(x)  VxG  X ,  and  let 

n(x)  =  arg  min  <  R(x,u)  +  a  Px,y(u(x))J(y) 
ugA(x)  [  y 

Then  J^l{x)  <  J(x),  VxG  X .  Furthermore,  if  n  is  not  optimal  for  Q\,  then  JA‘(x)  <  J(x) 
for  at  least  one  x  G  X . 

Proof:  Let  Jo(x)  =  R(x,  /i(x))+a  'f2]J  PXty(n(x))J(y),  and  consider  the  sequence  Ji(x),  J2(x), 
generated  by  the  recursion  Ji+1(x)  =  R(x,  /r(x))  +  a  )T)y  PXty(n(x))  Ji(y) ,  Vi  =  0, 1,  2, . . ..  At 
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state  x,  by  definition  of  J(x),  there  exists  ix3  such  that  J(x)  =  Jnj(x).  It  follows  that 


Jo(x)  <  R(x,  7Tj(x))  +  a  PX,y(^j(x))J (y) 

<  R(x,nj(x))  +  a'52yPx,y(nj(x))J7r*(y) 

=  Jnj  (x) 

=  j(x)  , 

and  since  x  is  arbitrary,  we  have 

Ji(x)  =  R(x,fi(x))  +  aJ2yPx,y(Kx))Jo(y) 

<  R(x,fi(x))  +  a'52yPXiy((j,(x))J(y) 

=  Jo(x)  . 

By  induction  Ji+1(x)  <  Ji(x),  Vx  G  X  and  Vi  =  0, 1,  2, . . ..  It  is  well  known  (Bertsekas 
1995)  that  the  sequence  Jo(x),  Ji(x),  J2(x), . . .  generated  above  converges  to  J^{x),  Vx  G  X. 
Therefore  J^l(x)  <  J(x),  V x.  If  JM(x)  =  J(x),  Vx  G  X,  then  PICS  reduces  to  the  standard 
policy  improvement  on  policy  n,  and  it  follows  that  y  satisfies  the  Bellman’s  optimality 
equation  and  is  thus  optimal  for  Q,\ .  Hence  we  must  have  J,l{x)  <  J(x)  for  some  x  G  X 
whenever  //  is  not  optimal.  ■ 

Now  at  the  kth  iteration,  given  the  current  policy  population  Ak,  we  compute  the  kth 
elite  policy  7 r*  via  PICS.  According  to  Theorem  1,  the  elite  policy  improves  any  policy  in 
Afc,  and  since  7r*  is  directly  used  to  generate  the  (k  +  1  )th  sub-MDP  (cf.  Figure  1  and 
Section  3.3),  the  following  monotonicity  property  is  immediately  clear: 

Corollary  1  For  all  k  >  0, 

r*+\x)  <  Jn*{x),  VxeX. 


Proof:  Follows  by  induction.  ■ 

In  EPI,  an  “elite”  policy  is  also  obtained  at  each  iteration  by  a  method  called  “policy 
switching”.  Unlike  PICS,  policy  switching  constructs  an  elite  policy  by  directly  manipulating 
each  individual  policy  in  the  population.  To  be  precise,  for  the  given  policy  population 
A  =  { 7r ! ,  7t2,  -  -  - ,  7 rn},  the  elite  policy  is  constructed  as 

7 t*(x)  G  <  argmin(J7ri(x))(x)  >,  VxGl,  (4) 

l  TTiSA  J 

where  the  value  functions  ,/7r'1 ,  V  7T*  G  A  are  obtained  by  solving  equation  (2).  It  is  proven  in 
Chang  et  al.  (2002)  that  the  elite  policy  7 r*  generated  by  policy  switching  also  improves  any 


policy  in  the  population  A.  Note  that  the  computational  complexity  of  executing  equation  (4) 
is  0(n\X\). 

We  now  provide  a  heuristic  comparison  between  PICS  and  policy  switching.  For  a  given 
group  of  policies  A,  let  O  be  the  policy  space  for  the  sub-MDP  Q\\;  it  is  clear  that  the  size 
of  0  is  on  the  order  of  'nJA'L  Policy  switching  only  takes  into  account  each  individual  policy 
in  A,  while  PICS  tends  to  search  the  entire  space  Q,  which  is  a  much  larger  set  than  A. 
Although  it  is  not  clear  in  general  that  the  elite  policy  generated  by  PICS  improves  the  elite 
policy  generated  by  policy  switching,  since  the  policy  improvement  step  is  quite  fast  and 
it  focuses  on  the  best  policy  updating  directions,  we  believe  this  will  be  the  case  in  many 
situations.  For  example,  consider  the  case  where  one  particular  policy,  say  7f,  dominates  all 
other  policies  in  A.  It  is  obvious  that  policy  switching  will  choose  7 f  as  the  elite  policy;  thus 
no  further  improvement  can  be  achieved.  In  contrast,  PICS  considers  the  sub-MDP  as 
long  as  7 f  is  not  optimal  for  a  better  policy  can  always  be  obtained. 

The  computational  complexity  of  each  iteration  of  PICS  is  approximately  the  same  as 
that  of  policy  switching,  because  step  1  of  PICS,  i.e.,  equation  (2),  which  is  also  used  by 
policy  switching,  requires  solution  of  n  systems  of  linear  equations,  and  the  number  of 
operations  required  by  using  a  direct  method  (e.g.,  Gaussian  Elimination)  is  0(n|A^|3),  and 
this  dominates  the  cost  of  step  2,  which  is  at  most  0(n |A"|J). 

3.3  Sub-MDP  Generation 

The  description  of  the  “sub-MDP  generation”  step  in  Figure  1  is  only  at  a  conceptual  level. 
In  order  to  elaborate,  we  need  to  distinguish  between  two  cases.  We  first  consider  the  discrete 
action  space  case;  then  we  discuss  the  setting  where  the  action  space  is  continuous. 

3.3.1  Discrete  Action  Spaces 

According  to  Corollary  1,  the  performance  of  the  elite  policy  at  the  current  iteration  is 
no  worse  than  the  performances  of  the  elite  policies  generated  at  previous  iterations.  Our 
concern  now  is  how  to  achieve  continuous  improvements  among  the  elite  policies  found 
at  consecutive  iterations.  One  possibility  is  to  use  unbiased  random  sampling  and  choose 
at  each  iteration  a  sub-MDP  problem  by  making  use  of  the  action  selection  distribution 
V.  The  sub-MDPs  at  successive  iterations  are  then  independent  of  one  another,  and  it 
is  intuitively  clear  that  we  may  obtain  improved  elite  policies  after  a  sufficient  number  of 
iterations.  Such  an  unbiased  sampling  scheme  is  very  effective  in  escaping  local  optima 
and  is  often  useful  in  finding  a  good  candidate  solution.  However,  in  practice  persistent 
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improvements  will  be  more  and  more  difficult  to  achieve  as  the  number  of  iterations  (sampling 
instances)  increases,  since  the  probability  of  hireling  better  elite  policies  becomes  smaller  and 
smaller.  See  Lourenco,  Martin,  and  Stiitzle  (2001)  for  a  more  insightful  discussion  in  a  global 
optimization  context.  Thus,  it  appears  that  a  biased  sampling  scheme  could  be  more  helpful, 
which  can  be  accomplished  by  using  a  “nearest  neighbor”  heuristic. 

To  achieve  a  biased  sampling  configuration,  ERPS  combines  exploitation  (“nearest  neigh¬ 
bor”  heuristic)  with  exploration  (unbiased  sampling).  The  key  to  balance  these  two  types  of 
searches  is  the  use  of  the  exploitation  probability  go-  For  a  given  elite  policy  7 r,  we  construct 
a  new  policy,  say  7r,  in  the  next  population  generation  as  follows:  At  each  state  x  G  X , 
with  probability  g0,  n(x)  is  selected  from  a  small  neighborhood  of  and  with  probability 
1  —  go,  fr(x)  is  chosen  by  using  the  unbiased  random  sampling.  The  preceding  procedure  is 
performed  repeatedly  until  we  have  obtained  n  —  1  new  policies,  and  the  next  population 
generation  is  simply  formed  by  the  elite  policy  n  and  the  n  —  1  newly  generated  policies. 
Intuitively,  on  the  one  hand,  the  use  of  exploitation  will  introduce  more  robustness  into 
the  algorithm  and  helps  to  locate  the  exact  optimal  policy,  while  on  the  other  hand,  the 
exploration  step  will  help  the  algorithm  to  escape  local  optima  and  to  find  attractive  poli¬ 
cies  quickly.  In  effect,  we  see  that  this  idea  is  equivalent  to  altering  the  underlying  action 
selection  distribution,  in  that  V  is  artificially  made  more  peaked  around  the  action  7r(a;). 

If  we  assume  that  A  is  a  non-empty  metric  space  with  a  defined  metric  d(-,  •),  then  the 
“nearest  neighbor”  heuristic  in  Figure  1  could  be  implemented  as  follows: 

Let  r,,  a  positive  integer,  be  the  search  range  for  state  xi:  i  =  1,  2, ... ,  |X|.  We  assume 
that  r'i  <  \A\  for  all  i ,  where  |A|  is  the  size  of  the  action  space. 

•  Generate  a  random  variable  l  ~  DU(l,ri ),  where  _D[/(l,r,)  represents  the  discrete 
uniform  distribution  between  1  and  rt .  Set  7 Tj+1(xi)  =  a  G  A  such  that  a  is  the  Ith 
closest  action  to  7r*(ay)  (measured  by  d{-,  •)). 

Remark  1:  Sometimes  the  above  procedure  is  not  easy  to  implement.  It  is  often  necessary 
to  index  a  possibly  high-dimensional  metric  space,  whose  complexity  will  depend  on  the 
dimension  of  the  problem  and  the  cost  in  evaluating  the  distance  functions.  However,  we 
note  that  the  action  spaces  of  many  MDP  problems  are  subsets  of  5ftA'r,  where  a  lot  of  efficient 
methods  can  be  applied,  such  as  Kd-trees  (Bentley  1979)  and  R-trees  (Guttman  1984).  The 
most  favorable  situation  is  an  action  space  that  is  “naturally  ordered”,  e.g.,  in  inventory 
control  problems  where  actions  are  the  number  of  items  to  be  ordered  A  =  {0, 1,  2,  •  •  • },  in 
which  case  the  indexing  and  ordering  becomes  trivial. 

Remark  2:  In  EPI,  policies  in  a  new  generation  are  generated  by  the  so-called  “policy 
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mutation”  procedure,  where  two  types  of  mutations  are  considered:  “global  mutation”  and 
“local  mutation” .  The  algorithm  first  decides  whether  to  mutate  a  given  policy  it  “globally” 
or  “locally”  according  to  a  mutation  probability  Pm.  Then  at  each  state  x,  n(x)  is  mutated 
with  probability  Pg  or  Pi,  where  Pg  and  Pi  are  the  respective  predefined  global  mutation  and 
local  mutation  probabilities,  ft  is  assumed  that  Pg  3>  Pf,  the  idea  is  that  “global  mutation” 
helps  the  algorithm  to  get  out  of  local  optima  and  “local  mutation”  helps  the  algorithm 
to  fine-tune  the  solution.  If  a  mutation  is  to  occur,  the  action  is  changed  by  using  the 
action  selection  probability  V.  As  a  result,  we  see  that  each  action  in  a  new  policy  generated 
by  “policy  mutation”  either  remains  unchanged  or  is  altered  by  pure  random  sampling; 
although  the  so-called  “local  mutation”  is  used,  no  local  search  element  is  actually  involved 
in  the  process.  Thus,  as  mentioned  before,  the  algorithm  only  operates  at  the  global  level. 
We  note  that  this  is  essentially  equivalent  to  setting  the  exploitation  probability  go  =  0  in 
our  approach. 

3.3.2  Continuous  Action  Spaces 

The  biased  sampling  idea  in  the  previous  section  can  be  naturally  extended  to  MDPs  with 
continuous  action  spaces.  We  let  Ba  be  the  smallest  u-algebra  containing  all  the  open  sets 
in  A,  and  choose  the  action  selection  distribution  V  as  a  probability  measure  defined  on 
(A,  Ba)-  Again,  denote  the  metric  defined  on  A  by  d(-,  •). 

By  following  the  “nearest  neighbor”  heuristic,  we  now  give  a  general  implementation  of 
the  exploitation  step  in  Figure  1. 

Let  rj  >  0  denote  the  search  range  for  state  Xi,  i  =  1,  2, . . . ,  |X|. 

•  Choose  an  action  uniformly  from  the  set  of  neighbors  {a  :  d(a,TTk(xi))  <  rg,  a  G  A}. 

Note  the  difference  in  the  search  range  rg  between  the  discrete  action  space  case  and 
the  continuous  action  space  case.  In  the  former  case,  rt  is  a  positive  integer  indicating  the 
number  of  candidate  actions  that  are  the  closest  to  the  current  elite  action  tt k{xi),  whereas  in 
the  latter  case,  rg  is  the  distance  from  the  current  elite  action,  which  may  take  any  positive 
value. 

If  we  further  assume  that  A  is  a  non-empty  open  connected  subset  of  3?^  with  some 
metric  (e.g.,  the  infinity-norm),  then  a  detailed  implementation  of  the  above  exploitation 
step  is  as  follows. 

•  Generate  a  random  vector  A*  =  (A), . . . ,  XlN)T  with  each  \lh  ~  U[—  1, 1]  independent  for 
all  h  —  1,  2, . . . ,  N,  and  choose  the  action  irk+1{xi)  =  nk(xi)  +  A Vj. 
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•  If  7r|+1(a:i)  f  A,  then  repeat  the  above  step. 

In  this  specihc  implementation,  the  same  search  range  rl  is  used  along  all  directions  of  the 
action  space.  However,  in  practice,  it  may  often  be  useful  to  generalize  r%  to  a  ./V-dimensional 
vector,  where  each  component  controls  the  search  range  in  a  particular  direction  of  the  action 
space. 

Remark  3:  Note  that  the  action  space  does  not  need  to  have  any  structure  other  than  being 
a  metric  space.  The  metric  d(-,  •)  used  in  the  “nearest  neighbor”  heuristic  implicitly  imposes 
a  structure  on  the  action  space.  It  follows  that  the  efficiency  of  the  algorithm  depends  on 
how  the  metric  is  actually  defined.  Like  most  of  the  random  search  methods  for  global 
optimizations,  our  approach  is  designed  to  explore  the  structure  that  good  policies  tend  to 
be  clustered  together.  Thus,  in  our  context,  a  good  metric  should  have  a  good  potential  in 
representing  this  structure.  For  example,  the  discrete  metric  (i.e.,  d(a,a )  =  0  V  a  G  A  and 
d(a,  b)  =  1  V  a,  b  e  A,  a  ^  b)  should  never  be  considered  as  a  good  choice,  since  it  does  not 
provide  us  with  any  useful  information  about  the  action  space.  For  a  given  action  space, 
a  good  metric  always  exists  but  may  not  be  known  a  priori.  In  the  special  case  where  the 
action  space  is  a  subset  of  5fhY,  we  take  the  Euclidean  metric  as  the  default  metric,  this  is  in 
accord  with  most  of  the  optimization  techniques  employed  in  dtN . 

3.4  Stopping  Rule 

Different  stopping  criteria  can  be  used.  The  simplest  one  is  to  stop  the  algorithm  when  a 
predefined  maximum  number  of  iterations  is  reached.  In  the  numerical  experiments  reported 
in  Section  6,  we  use  one  of  the  most  common  stopping  rules  in  standard  GAs  (Srinivas  and 
Patnaik  1994;  Wells  et  al.  1999;  Chang  et  al.  2002):  the  algorithm  is  stopped  when  no  further 
improvement  in  the  value  function  is  obtained  for  several,  say  K,  consecutive  iterations.  To 
be  precise,  we  stop  the  algorithm  if  3  k  >  0,  such  that  ||  Jn*+  —  Jn*  ||  =  0  V  m  —  1,  2, . . . ,  K. 

4  Convergence  of  ERPS 

In  this  section,  we  discuss  the  convergence  properties  of  ERPS.  To  do  so,  the  following 
notation  is  necessary. 

As  before,  denote  by  d(-,  •)  the  metric  on  the  action  space  A.  We  define  the  distance 
between  two  policies  7T1  and  n2  by 

doo(7r\7T2)  :=  max  d{7T1(xi),n2(xi)). 
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For  a  given  policy  n  G  II  and  any  a  >  0,  we  further  define  the  cr-neighborhood  of  n  by 

J\f( 7T,  &)  :=  { 7T |  doo^TT,  7r)  <  <7,  V  7T  G  11}  . 

For  each  policy  7T  G  II,  we  also  define  P \  as  the  transition  matrix  whose  (x,y)th  entry  is 
Px,y{p(x))  and  Rn  as  the  one-stage  cost  vector  whose  {x)th  entry  is  R(x,tt(x)). 

As  the  ERPS  method  is  randomized,  different  runs  of  the  algorithm  will  give  different 
sequences  of  elite  policies  (i.e.,  sample  paths);  thus  the  algorithm  induces  a  probability 
distribution  over  the  set  of  all  sequences  of  elite  policies.  We  denote  by  V(-)  and  E(-)  the 
probability  and  expectation  taken  with  respect  to  this  distribution. 

Let  ||  ■  || oo  denote  the  infinity-norm,  given  by  ||J||oo  :=  nrax^gx  |J(x)|.  We  have  the 
following  convergence  result  for  the  ERPS  algorithm. 

Theorem  2  Let  7 r*  be  an  optimal  policy  with  corresponding  value  function  Jn* ,  and  let  the 
sequence  of  elite  policies  generated  by  ERPS  together  with  their  corresponding  value  functions 
be  denoted  by  {zr^,  k  —  1,2, . . .}  and  {J77*,  k  —  1,2, . . respectively.  Assume  that: 

1.  q0  <  1. 

2.  For  any  given  t  >  0,  V({a\  dfa,  tt*(x))  <£,  a  G  A})  >0,  Vi  6  X,  (recall  that  'P(-)  is 
a  probability  measure  on  the  action  space  A). 

3.  There  exist  constants  a  >  0,  (f)  >  0,  L\  <  00,  and  L2  <  00,  such  that  for  all  n  G 

Af(n*,cr)  we  have  \\Pn  —  iVHoo  <  min  {LidootA; tt*),  —  0}  (0  <  a  <  1),  and 

||  Rn  Rn*  1 1 00  ^  L2d00(7r,  7T  ). 

Then  for  any  given  e  >  0,  there  exists  a  random  variable  A4e  >  0  with  E(M.£)  <  00  such 
that  ||  Jn*  —  J^||oo  <  £  V  k  >  M.e. 

Remark  4:  Assumption  1  restricts  the  exploitation  probability  from  pure  local  search. 
Assumption  2  simply  requires  that  any  “ball”  that  contains  the  optimal  policy  will  have  a 
strictly  positive  probability  measure.  It  is  trivially  satisfied  if  the  set  {a\d(a,  vr*(a;))  <  £,  a  G 
A}  has  a  positive  (Borcl)  measure  V  x  G  X  and  the  action  selection  distribution  V  has 
infinite  tails  (e.g.,  Gaussian,  exponential).  Assumption  3  imposes  some  Lipschitz  type  of 
conditions  on  Pn  and  Rn ;  as  we  will  see,  it  formalizes  the  notion  that  near  optimal  policies 
are  clustered  together  (cf.  remark  3).  The  assumption  can  be  verified  if  Pn  and  Rn  are 
explicit  functions  of  it  (which  is  the  case  in  our  numerical  examples;  cf.  Section  6).  For  a 
given  £  >  0,  such  a  policy  7 r  satisfying  ||  J71-  —  J77*^  <  £  is  referred  to  as  an  ^-optimal  policy 
(cf.  Bertsekas  1995). 
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Remark  5:  The  result  in  Theorem  2  implies  the  a.s.  convergence  of  the  sequence  { Jn* ,  k  = 
0, 1, . . .}  to  the  optimal  value  function  P1"*.  To  see  this,  note  that  Theorem  2  implies  that 
P(||  J7r*  —  Jn  ||oo  >  £)  — 1 ►  0  as  k  — >  cx)  for  every  given  £,  which  means  that  the  sequence 
converges  in  probability.  Furthermore,  since  ||  Jn*  —  Jn  ||oo  A  £  V  k  >  A4£  is  equivalent  to 
supfc>fc  ||  Jn*  —  J^Hoo  <e\/k>  A4S,  we  will  also  have  V{ sup^>fe  \\Jn*  —  J^’Hoo  >  e)  — >  0  as 
k  — >  cx),  and  the  a.s.  convergence  thus  follows. 

Proof:  The  first  step  in  the  proof  is  to  derive  an  npperbonnd  for  \\Jn  —  Jn*\\ oc  in  terms  of 


the  distance  d^TT,  tt*).  For  policy  7 r*  and  policy  tt  we  have: 

Jn  =  Rn*  +aP7V*J^  ,  (5) 

r  =  Rn  +  aPvJ*.  (6) 

Now  subtract  the  above  two  equations  and  define  A  Jn*  =  Jn  —  Pr*,  A P^*  =  Pw  —  Pn* 
and  A Rn*  =  Rn  —  Rn* .  We  have 

A r*  =  [/-(/-  -  aPn*)~\aAPn*Jn*  +  A Rn*).  (7) 


Taking  the  norm  of  both  sides  of  (7)  and  using  the  consistency  property  of  the  operator 
norm  (i.e.,  ]|AB||  <  ||A||  •  ||P||  ),  it  follows  that 

llAJ^Hoo  <  ||[J-(/-aP^~1aAPu*]~1||oJ(J-aP7r.0~1||ooH|APvr*||oo||Pr*||oo  +  ||AP7r*||oo). 

(8) 

By  assumption  3,  we  have  UAP^Uoo  <  thus 

||  (/  -  aPff.)_1aAP^.||00  <  ||(/-q'Pw.)_1||o0q:||APw.||00 

<  ||(J  -  aP7r.)_1||oo(l  -  a) 

<  1. 

We  now  try  to  divide  both  sides  of  equation  (8)  by  ||  J7r*  ||oo-  Before  we  proceed,  we  need  to 
distinguish  between  two  cases,  HP^Hoo  =  0  and  HP^Hoo  7^  0. 

Case  1.  If  Rn*  =  0  (i.e.,  R(x,tt*(x))  =  0  for  all  x  G  A),  then  we  have  Jn *  =  0.  Thus 
A  Jn*  =  Jn  and  A Rn*  =  Rn.  By  noting  1 1 P^  1 1  oc  =  1,  it  follows  from  (6)  that 

HAP1"  || oo  =  HJloc  A  T - 11  p  11 - UPttIIoo  =  t - IIAP^Hoo- 

1  a||  1 7r  1 1 00  1  ^ 

Then  by  assumption  3, 

||AJ7r‘||00<-^d00(7r,7r*).  (9) 

1  —  a 
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Case  2.  If  Rw*  >  0  (i.e. ,  R(x,  n*(x))  >  0  for  some  x  e  X ),  then  from  (5),  J77*  >  0.  Divide 
both  sides  of  (8)  by  || «/7r* || oo5  use  the  relation  that  || (/  —  _B)_1||  <  1_yB||  whenever  ||P||  <  1 
and  the  consistency  property;  it  immediately  follows  that 


||  AJ71 


I  J71 


< 


||  (/  —  aPn*) 


-n 


< 


< 


1  -  \\(I  ~  aP^lUaWAP^l 

||(I  —  Qp7r*)"1||00||/  —  aPn*  ||, 

1  -  \\{I  -  aP^-^aWAP^l 
K  f  a\\XPn*  Hoc 


1  -/Ci 


IIAP^I 


I-aPn* 

X 


^11  AP-tt*  ||  oo 


where  /C  =  ||(J  —  aPn*) 


\I-aPw* 

-in  n 


-  aPn*  I 
cx.L\ 


1 1  -  aPw 


+ 


+ 


allAP*.  ||  oo  + 
qIIAP^Hoo 

||  I  (xPn*  ||  oo 

||  XRn*  ||c 


II  AP„ 


U7' 


+ 


||APJ[ 


1 1  -  « /V- 1 


I  J77*! 


|  Rtt*  I 

L‘2 


||P, 


^oo(7T,  7T*), 


(10) 


oo 


/  —  aPn 


In  view  of  (9)  and  (10),  we  conclude  that  for  any  given  £  >  0,  there  exists  a  9  >  0  such 
that  for  any  7 r  G  X(n*,  cr)  where 

doo(7r,  7r*)  :=  max  d(n(xi),  n*(xi))  <  9, 

1<*<|X| 

we  have  ||  J77— J77*  Hoc  =  ||  AJ77*  <  e.  Note  that  max!<j<|x|  d(7r(xi),  n *(x*))  <9  is  equivalent 

to 

d(Tr(xi),n*(xi))  <9,  V  i  =  1,2, ,  |A|.  (11) 

By  assumption  2,  the  set  of  actions  that  satisfies  (11)  will  have  a  strictly  positive  proba¬ 
bility  measure,  and  since  qo  <  1,  it  follows  that  the  probability  a  population  generation 
does  not  contain  a  policy  in  the  neighborhood  «A/"(7t*,  min  {9,  <r})  of  the  optimal  policy  is 
strictly  less  than  1.  Let  ^  be  the  probability  that  a  randomly  constructed  policy  is  in 
min  {9,  a}).  Then  at  each  iteration,  the  probability  that  at  least  one  policy  is  ob¬ 
tained  in  A/"(7t*,  min  {9,  a})  is  1  —  (1  —  ^)n_1,  where  n  is  the  population  size.  Thus,  the 

initial  part  of  the  proof  together  with  Theorem  1  implies  that  the  value  function  of  the  elite 

policy  obtained  in  the  next  iteration  is  £-optimal  at  least  with  probability  1  —  (1  —  ip)11-1. 
Let  XI  s  denote  the  number  of  iterations  required  to  generate  such  an  elite  policy  for  the 
first  time.  By  the  monotonicity  of  the  sequence  {J77*,  k  =  0,1,...}  (cf.  Corollary  1),  it 
is  clear  that  H-/77*  —  J77  ||oo  A  £  V  k  >  Xl£.  Now  consider  a  random  variable  Xt  that  is 

geometrically  distributed  with  a  success  probability  of  1  —  (1  —  ^)n  l.  It  is  not  difficult  to 

see  that  Xi  dominates  XI  s  stochastically  (i.e.,  Xi  >st  XI s),  and  because  i/j  >  0,  it  follows 
that  E(Xle)  <  E(M)  =  y _ ( ‘y y  —  i  <  °°-  ■ 
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Remark  6:  In  the  above  proof,  we  have  used  the  infinity- norm.  Since  in  finite  dimen¬ 
sional  spaces  all  norms  are  equivalent  (cf.  Demmel  1997),  similar  results  can  also  be  easily 
established  by  using  different  norms,  e.g.,  the  Euclidean-norm. 

Remark  7 :  It  should  be  noted  that  the  result  presented  in  Theorem  2  is  rather  theoretical, 
because  nothing  can  be  said  about  the  convergence  rate  of  the  algorithm  as  well  as  how  much 
improvement  can  be  achieved  at  each  iteration.  As  a  consequence,  the  random  variable  A4e 
could  be  extremely  large  in  practice. 

Note  that  for  a  discrete  finite  action  space,  assumption  3  in  Theorem  2  is  automatically 
satisfied,  and  assumption  2  also  holds  trivially  if  we  take  V(a)  >  0  for  all  actions  a  G  A. 
Furthermore,  when  the  action  space  is  finite,  there  always  exists  an  e  >  0  such  that  the  only 
^-optimal  policy  is  the  optimal  policy  itself.  We  have  the  following  stronger  convergence 
result  for  ERPS  when  the  action  space  is  finite. 

Corollary  2  (Discrete  finite  action  space)  If  the  action  space  is  finite,  q0  <  1,  and  the 
action  selection  distribution  Via)  >  0  Va  e  A,  then  there  exists  a  random  variable  M.  >  0 
such  that  V{M.  <  oo)  =  1  and  E{M.)  <  oo,  and  -V*  =  J 7r*  V  k  >  M.. 

5  Adaptive  ERPS 

The  search  range  parameter  rt  in  ERPS  is  fixed  throughout  the  algorithm.  Intuitively,  small 
search  ranges  concentrate  the  search  in  small  regions  around  the  desirable  points  and  are 
helpful  in  refining  promising  solutions,  but  they  often  lead  to  small  improvements  in  the  cost 
function,  thus  slowing  down  the  convergence  process.  On  the  other  hand,  large  search  ranges 
typically  reduce  the  number  of  search  steps  needed  to  find  a  good  or  near  optimal  solution, 
but  can  be  less  effective  in  developing  finer  details  around  desirable  points  and  may  result 
in  less  accurate  solutions.  In  this  section,  we  present  a  modification  of  the  ERPS  method 
in  which  the  value  of  the  search  range  parameter  may  change  from  one  iteration  to  another. 
The  idea  is  to  adaptively  shrink  and  expand  the  search  range  so  that  we  can  speed  up  the 
convergence  process  without  sacrificing  the  solution  quality.  A  detailed  description  of  the 
adaptive  ERPS  is  given  in  Figure  2,  where  we  only  consider  the  continuous  action  space 
case;  the  discrete  action  space  version  can  be  constructed  similarly. 

We  start  by  running  ERPS  with  an  initially  specified  search  range  r  (for  simplicity,  we 
assume  that  the  same  search  range  is  prescribed  for  all  states),  and  monitor  the  performance 
of  the  elite  policy  obtained  at  each  iteration.  If  no  improvements  among  the  elite  policies 
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Adaptive  ERPS 

•  Initialization:  Specify  an  initial  search  range  r,  parameters  K,  1  <  I\\  <  K, 
A2  >  1,  A3  >  1,  7  >  1  and  a  tolerance  level  e  >  0,  where  K  is  the  stopping  control 
parameter  as  in  ERPS.  Set  i  <—  0,  j  *—  0,  and  h  4—  0. 

•  while  (i  <  K  &  h  <  A'3) 

—  Execute  ERPS  with  search  range  r. 

—  Search  range  update: 

if  0  <  ||  Jn*+1  —  Jn*  ||  <  e,  then  set  *  <—  0,  j  <—  j  +  1; 
elseif  ||  J%*+1  —  J ||  =  0,  then  set  i  <—  i  +  1,  j  <—  0; 
else  set  *  <—  0,  j  <—  0. 

end  if 

if  i  >  I\ i,  then  set  r0id  <—  r,  r  4—  r  ■  end  if 

if  j  >  K2,  then  set  r  <—  r  •  7.  end  if 

if  r  =  r0id,  then  set  h  <—  h  +  1;  else  set  h  <—  0.  end  if 

end  while 


Figure  2:  Adaptive  ERPS 


are  achieved  for  several,  say  Ad,  consecutive  iterations,  then  it  indicates  that  the  current 
search  range  may  be  too  large,  and  we  decrease  it  by  a  factor  7  >  1.  On  the  other  hand, 
if  for  some  consecutive  iterations,  say  K2j  the  improvements  are  non- zero  but  smaller  than 
some  given  tolerance  £,  then  it  is  likely  that  the  current  search  range  is  too  small,  and  we 
increase  it  by  7  until  the  improvement  is  greater  than  the  specified  tolerance  level.  The 
search  range  is  updated  repeatedly  until  it  has  been  alternating  between  two  values  for  A'3 
times.  Intuitively,  the  adaptive  ERPS  ensures  that  each  improvement  in  the  elite  policy  is 
(approximately)  at  least  e;  when  no  further  improvement  is  available  either  by  increasing  or 
by  decreasing  the  search  range,  the  value  function  obtained  will  be  within  distance  £  of  the 
optimal  cost,  i.e.,  the  resulting  elite  policy  is  approximately  ^-optimal. 

Note  that  the  validity  of  the  e-optimality  claim  relies  on  the  assumption  that  if  there  is 
an  improvement  of  at  least  e  available,  then  the  algorithm  will  be  able  to  find  it  via  adaptive 
adjustment  of  the  search  range.  The  above  approach  retains  the  theoretical  convergence 
properties  of  the  original  ERPS  method  and  can  be  applied,  at  least  in  principle,  to  many 
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types  of  action  spaces  as  long  as  a  metric  can  be  specified;  however,  we  must  again  emphasize 
that  the  efficiency  of  the  approach  will  depend  on  the  structure  of  the  problem  to  be  solved 
and  how  the  underlying  metric  is  actually  defined. 

6  Numerical  Examples 

In  this  section,  we  apply  ERPS  to  two  discrete-time  controlled  queueing  problems  and  com¬ 
pare  its  performance  with  that  of  EPI  (Chang  et  al.  2002)  and  standard  PI.  For  ERPS,  the 
same  search  range  parameter  is  prescribed  for  all  states,  denoted  by  a  single  variable  r,  and 
the  action  selection  distribution  V  is  chosen  to  be  the  uniform  distribution.  All  computations 
were  performed  on  an  IBM  PC  with  a  2.4  GHz  Pentium  4  processor  and  512  MB  memory, 
and  the  computational  time  units  are  in  seconds. 

6.1  A  One-Dimensional  Queueing  Example 

The  following  example  is  adapted  from  de  Farias  and  Van  Roy  2003  (cf.  also  Bertsekas 
1995).  We  consider  a  finite  capacity  single-server  queue  with  controlled  service  completion 
probabilities.  Assume  that  a  server  can  serve  only  one  customer  in  a  period,  and  the  service 
of  a  customer  begins/ends  only  at  the  beginning/end  of  any  period.  Customers  arrive  at  a 
queue  independently  with  probability  p  =  0.2,  and  there  is  at  most  one  arrival  per  period  (so 
no  arrival  with  probability  0.8).  The  maximum  queue  length  is  £,  and  an  arrival  that  finds 
C  customers  in  the  queue  is  lost.  We  let  xt ,  the  state  variable,  be  the  number  of  customers 
in  the  system  at  the  beginning  of  period  t.  The  action  to  be  chosen  at  each  state  is  the 
service  completion  probability  a,  which  takes  value  in  a  set  A.  In  period  t,  a  possible  service 
completion  is  generated  with  probability  a(xt),  a  cost  of  R(xt,  a(xt ))  is  incurred,  and  resulting 
in  a  transition  to  state  xt+ 1-  The  goal  is  to  choose  the  optimal  service  completion  probability 
for  each  state  such  that  the  total  discounted  cost  is  minimized. 

6.1.1  Discrete  Action  Space 

Two  different  choices  of  one-stage  cost  functions  are  considered:  (i)  a  simple  cost  function 
that  is  convex  in  both  state  and  action;  (ii)  a  complicated  non-convex  cost  function.  The 
MDP  problem  resulting  from  case  (i)  may  possess  some  nice  properties  (e.g.,  free  of  multiple 
local  optimal  solutions),  so  finding  an  optimal  solution  should  be  a  relatively  easy  task; 
whereas  the  cost  function  in  case  (ii)  introduces  some  further  computational  difficulties  (e.g., 
multiple  local  minima),  intended  to  more  fully  test  the  effectiveness  of  a  global  algorithm  like 
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ERPS.  For  both  cases,  unless  otherwise  specified,  the  following  parameter  settings  are  used: 
maximum  queue  length  C  =  49;  state  space  X  =  {0,1,2,...,  49};  discount  factor  a  =  0.98; 
action  set  A  =  {10~4/c  :  k  —  0, 1, . . . ,  104};  and  in  ERPS,  population  size  n  =  10,  search 
range  r  =  10,  and  the  standard  Euclidean  distance  is  used  to  define  the  neighborhood.  All 
results  for  ERPS  are  based  on  30  independent  replications. 

For  case  (i),  the  one-stage  cost  at  any  period  for  being  in  state  x  and  taking  action  a  is 
given  by 

R(x,  a)  =  x  +  50a2. 


We  test  the  convergence  of  ERPS  by  varying  the  values  of  the  exploitation  probability. 
Table  1  gives  the  performances  of  the  algorithm,  where  the  relative  error  of  a  value  function 
J  is  evaluated  in  the  infinity-norm  according  to  the  following  formula 

I  j  ~  7*11 

I  oo 


relerr  := 


I  J* 


(12) 


and  J*  is  the  optimal  value  function.  The  computational  time  required  for  PI  to  find 
the  optimal  value  function  J*  was  15  seconds,  and  the  value  of  ||</*||oo  is  approximately 
2.32e+03.  Test  results  indicate  superior  performances  of  ERPS  over  PI;  in  particular,  when 
q0  =  0.25,  0.5,  0.75,  ERPS  attains  the  optimal  solutions  in  all  30  independent  trials  within  2 
seconds. 

To  explore  the  computational  complexity  of  ERPS,  tests  were  performed  on  MDPs  with 
increasing  numbers  of  actions;  for  each  problem,  the  foregoing  setting  is  used  except  that 
the  action  space  now  takes  the  form  A  =  \hk  :  k  —  0, 1, . . . ,  4},  where  h  is  the  mesh  size,  se¬ 
lected  sequentially  (one  for  each  problem)  from  the  set  {  ^  I^,  ^q, 

-A _ I _ I_\ 

50000  ’  100000  ’  200000  / ' 

In  Figure  3,  we  plot  the  running  time  required  for  PI  and  ERPS  to  find  the  optimal 
solutions  as  a  function  of  the  number  of  actions  of  each  MDP  considered,  where  the  results 
for  ERPS  are  the  averaged  time  over  30  independent  replications.  Empirical  results  indicate 
that  the  computational  time  for  PI  increases  linearly  in  the  number  of  actions  (note  the  log- 
scale  used  in  Figure  3),  while  the  running  time  required  for  ERPS  does  so  in  an  asymptotic 
sense.  We  see  that  ERPS  delivers  very  competitive  performances  even  when  the  action  space 
is  small;  when  the  action  space  is  relatively  large  (number  of  actions  greater  than  104),  ERPS 
reduces  the  computational  efforts  of  PI  by  roughly  a  factor  of  14.  In  the  experiments,  we 
used  a  search  range  r  =  10  in  ERPS,  regardless  of  the  size  of  the  action  space;  we  believe  the 
performance  of  the  algorithm  could  be  enhanced  by  using  a  search  range  that  is  proportional 
to  the  size  of  the  action  space.  Moreover,  the  computational  effort  of  ERPS  can  be  reduced 
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% 

stop  rule  (K) 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

2 

0.84  (0.03) 

7.63e-06 

(8.50e-08) 

4 

1.41  (0.05) 

2.78e-06 

(3.29e-07) 

0.0 

8 

2.67  (0.10) 

7.83e-07 

(1.06e-07) 

16 

5.12  (0.16) 

1.81e-07 

(1.88e-08) 

32 

8.91  (0.38) 

6.19e-08 

(1.07e-08) 

2 

0.94  (0.02) 

3.32e-09 

(1.42e-09) 

4 

1.08  (0.02) 

9.65e-10 

(2.59e-10) 

0.25 

8 

1.24  (0.02) 

3.02e-10 

(9.51e-ll) 

16 

1.52  (0.03) 

4.54e-ll 

(3.86e-ll) 

32 

1.85  (0.04) 

0.00e-00 

(0.00e-00) 

2 

0.92  (0.02) 

2.14e-09 

(1.29e-09) 

0.50 

4 

1.00  (0.02) 

2.53e-10 

(1.10e-10) 

8 

1.11  (0.02) 

7.61e-ll 

(5.02e-ll) 

16 

1.27  (0.03) 

0.00e-00 

(0.00e-00) 

2 

1.14  (0.02) 

4.14e-10 

(2.84e-10) 

0.75 

4 

1.19  (0.02) 

2.40e-ll 

(1.67e-ll) 

8 

1.27  (0.02) 

1.18e-ll 

(1.18e-ll) 

16 

1.44  (0.03) 

0.00e-00 

(0.00e-00) 

2 

12.14  (0.02) 

1.66e-10 

(5.18e-ll) 

1.0 

4 

12.19  (0.02) 

4.85e-ll 

(3.49e-ll) 

8 

12.28  (0.01) 

0.00e-00 

(0.00e-00) 

Table  1:  Convergence  results  for  ERPS  (n  =  10,  r  =  10)  based  on  30  independent  replica¬ 
tions.  The  standard  errors  are  in  parentheses. 


considerably  if  we  are  merely  seeking  solutions  within  some  required  accuracy  rather  than 
insisting  on  the  optimal  solution. 

For  case  (ii),  we  used  the  following  one-stage  cost  function 

R(x,  a)  =  x  +  5  -^-sin(2ira)  —  x 


which  induces  a  tradeoff  in  choosing  between  large  values  of  a  to  reduce  the  state  x  and 
appropriate  values  of  a  to  make  the  squared  term  small.  Moreover,  since  the  sine  function 
is  not  monotone,  the  resultant  MDP  problem  has  a  very  high  number  of  local  minima;  some 
typical  locally  optimal  policies  are  shown  in  Figure  4. 
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number  of  actions  number  of  actions 


(b) 


Figure  3:  Running  time  required  for  PI  &  ERPS  (n  =  10,  r  =  10,  based  on  30  independent 
replications)  to  find  the  optimal  solutions  to  MDPs  with  different  numbers  of  actions,  (a) 
using  log-scale  for  horizontal  axis;  (b)  using  log-log  plot. 


Table  2  shows  the  convergence  properties  of  EPI  and  ERPS,  where  both  algorithms  start 
with  the  same  initial  population.  The  computational  time  required  for  PI  to  find  the  optimal 
value  function  J*  was  14  seconds,  and  the  magnitude  of  ||J*||oo  is  approximately  1.03e+05. 
For  EPI,  we  have  tested  different  sets  of  parameters  (recall  from  Section  3.3.1,  Remark  2, 
that  Pm  is  the  mutation  probability;  and  Pg  (Pi)  are  the  predefined  global  (local)  mutation 
probabilities);  the  results  reported  in  Table  2  are  the  best  results  obtained.  Also  note  that 
because  of  the  slow  convergence  of  EPI,  the  values  for  the  stopping  control  parameter  K  are 
chosen  much  larger  than  those  for  ERPS. 

In  order  to  demonstrate  the  role  of  the  exploitation  probability  g0  in  the  ERPS  algorithm, 
we  fix  the  stopping  control  parameter  K  =  10  and  vary  go-  The  numerical  results  are  recorded 
in  Table  3,  where  Nopt  indicates  the  number  of  times  an  optimal  solution  was  found  out  of 
30  trials.  The  g0  =  1.0  case  corresponds  to  pure  local  search.  Obviously  in  this  case,  the 
algorithm  gets  trapped  into  a  local  minimum,  which  has  a  mean  relative  error  of  5.62e- 
3.  However,  note  that  the  standard  error  is  zero,  which  means  that  the  local  minimum 
is  estimated  with  very  high  precision.  This  shows  that  the  “nearest  neighbor”  heuristic  is 
indeed  useful  in  fine-tuning  the  solutions.  In  contrast,  the  pure  random  search  (g0  =  0)  case 
is  helpful  in  escaping  from  the  local  minima,  yielding  a  lower  mean  relative  error  of  2.59e-5, 
but  it  is  not  very  good  in  locating  the  exact  optimal  solutions,  as  none  was  found  out  of  30 
trials.  Roughly,  increasing  g0  between  0  and  0.5  leads  to  a  more  accurate  estimation  of  the 
optimal  solution;  however,  increasing  g0  on  the  range  0.6  to  1.0  decreases  the  quality  of  the 
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action  action 


Figure  4:  Four  typical  locally  optimal  solutions  to  the  test  problem. 


algorithms 

stop  rule  (K) 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

EPI 

20 

2.13  (0.11) 

1.74e-02  (1.35e-03) 

Pm  =  0.1 

40 

3.80  (0.16) 

1.12e-02  (8.81e-04) 

Pg  =  0.9 

80 

6.63  (0.34) 

7.13e-03  (5.37e-04) 

Pi  =  0.1 

160 

16.30  (0.59) 

3.22e-03  (2.26e-04) 

2 

1.03  (0.02) 

9.81e-05  (5.17e-05) 

ERPS 

4 

1.12  (0.03) 

7.12e-05  (4.95e-05) 

go  =  0.5 

8 

1.28  (0.03) 

2.37e-05  (1.64e-05) 

r  =  10 

16 

1.50  (0.03) 

1.06e-09  (6.59e-10) 

32 

1.86  (0.04) 

0.00e-00  (0.00e-00) 

Table  2:  Convergence  results  for  EPI  (n  =  10)  &  ERPS  (n  =  10,  r  =  10)  based  on  30 
independent  replications.  The  standard  errors  are  in  parentheses. 
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solution,  because  the  local  search  part  begins  to  gradually  dominate,  so  that  the  algorithm  is 
more  easily  trapped  in  local  minima.  This  also  explains  why  we  have  larger  variances  when 
qo  =  0.6,  0.7,  0.8,  0.9  in  Table  3.  Notice  that  the  algorithm  is  very  slow  in  the  pure  local 
search  case;  setting  q0  <  1  speeds  up  the  algorithm  substantially. 


% 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

0.0 

3.30  (0.13) 

0 

2.59e-05  (6.19e-06) 

0.1 

1.96  (0.04) 

5 

4.51e-08  (8.60e-09) 

0.2 

1.48  (0.03) 

12 

1.26e-08  (3.47e-09) 

0.3 

1.39  (0.02) 

24 

2.74e-09  (2.02e-09) 

0.4 

1.28  (0.02) 

25 

2.69e-05  (1.89e-05) 

0.5 

1.32  (0.03) 

27 

8.75e-10  (6.01e-10) 

0.6 

1.41  (0.04) 

25 

6.19e-05  (3.20e-05) 

0.7 

1.50  (0.04) 

22 

1.53e-04  (6.96e-05) 

0.8 

1.81  (0.04) 

15 

3.04e-04  (7.09e-05) 

0.9 

2.33  (0.08) 

11 

7.99e-04  (1.63e-04) 

1.0 

7.86  (0.02) 

0 

5.62e-03  (0.00e-00) 

Table  3:  Performance  of  ERPS  with  different  exploitation  probabilities  (n  =  10,  K  =  10,  r  = 
10)  based  on  30  independent  replications.  The  standard  errors  are  in  parentheses. 

To  provide  a  numerical  comparison  between  the  “nearest  neighbor”  heuristic  (biased 
sampling)  and  the  policy  mutation  procedure  (unbiased  sampling),  we  call  the  algorithm 
with  the  PICS  step  but  policy  mutation  procedure  as  algorithm  1.  In  both  ERPS  and 
algorithm  1,  we  fix  the  population  size  n  =  10,  and  stop  the  algorithms  only  when  a  desired 
accuracy  is  reached.  In  Table  4,  we  record  the  length  of  time  required  for  different  algorithms 
to  reach  a  relative  error  of  at  least  1.0e-6.  Indeed,  we  see  that  ERPS  uses  far  less  time  to 
reach  a  required  accuracy  than  algorithm  1  does. 

6.1.2  Continuous  Action  Space 

We  test  the  algorithm  when  the  action  space  A  is  continuous,  where  the  service  completion 
probability  can  be  any  value  between  0  and  1.  Again,  two  cost  functions  are  considered, 
corresponding  to  cases  (i)  and  (ii)  in  Section  6.1.1.  In  both  cases,  the  maximum  queue 
length  £,  state  space  X,  and  the  discount  factor  a  are  all  taken  to  be  the  same  as  before. 

In  the  numerical  experiments,  we  approximated  the  optimal  costs  and  for  each  of 
the  respective  cases  (i)  and  (ii)  by  two  value  functions  and  J|,  which  were  computed  by 
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algorithms 

parameters 

Avg.  time  (std  err) 

actual  rclerr  (std  err) 

q0  =  0.0 

13.31  (0.60) 

7.63e-07  (3.71e-08) 

qo  =  0.1 

1.20  (0.03) 

4.99e-07  (5.47e-08) 

ERPS 

q0  =  0.3 

0.96  (0.04) 

3.26e-07  (4.83e-08) 

r  =  10 

qo  =  0.5 

0.97  (0.03) 

3.84e-07  (5.08e-08) 

qo  =  0.7 

1.61  (0.18) 

3.47e-07  (4.91e-08) 

qo  =  0.9 

4.03  (0.62) 

2.33e-07  (4.62e-08) 

Pm  =  0.1,  Pg  =  0.9,  Pi  =  0.1 

62.4  (3.0) 

7.61e-07  (3.67e-08) 

Pm  =  0.3,  Pg  =  0.9,  Pi  =  0.1 

33.3  (1.4) 

8.42e-07  (2.76e-08) 

ALG.  1 

Pm  =  0.5,  Pg  =  0.9,  Pi  =  0.1 

26.6  (1.4) 

8.35e-07  (2.93e-08) 

Pm  =  0.7,  Pg  =  0.9,  Pi  =  0.1 

22.1  (1.2) 

7.88e-07  (3.34e-08) 

Pm  =  0.9,  Pg  =  0.9,  Pi  =  0.1 

20.2  (1.1) 

8.44e-07  (2.55e-08) 

Pm  =  1.0,  Pg  =  1.0,  Pi  =  0.0 

17.6  (0.9) 

7.67e-07  (4.08e-08) 

Table  4:  Average  time  required  to  reach  a  precision  of  at  least  1.0e-6  for  different  algorithms. 
All  results  are  based  on  30  independent  replications.  The  standard  errors  are  in  parentheses. 


using  the  adaptive  ERPS  algorithm  under  the  following  parameter  settings:  population  size 
n  =  10;  stopping  control  parameter  K  =  10;  exploitation  probability  q0  =  0.5;  initial  search 
range  r  =  A;  tolerance  £  =  le-12  for  case  (i)  and  £  =  le-10  for  case  (ii);  K\  =  5;  K2  =  5; 
A'3  =  5;  7  =  2.  We  performed  200  independent  runs  of  the  adaptive  ERPS  algorithm  for 
each  case,  and  J*  (J2)  was  obtained  as  the  best  solution  out  of  the  200  replications. 

We  set  the  population  size  n  =  10,  termination  control  parameter  K  =  10,  and  test 
the  ERPS  algorithm  by  using  different  values  of  the  search  range  r.  The  performance  of 
the  algorithm  is  also  compared  with  that  of  a  deterministic  policy  iteration  (PI)  algorithm, 
where  we  first  uniformly  discretize  the  action  space  into  evenly  spaced  points  by  using  a  mesh 
size  h,  and  then  apply  the  standard  PI  algorithm  on  the  discretized  problem.  Tables  5  and  6 
give  the  performances  of  both  algorithms  for  cases  (i)  and  (ii),  respectively.  Note  that  the 
relative  errors  are  actually  computed  by  replacing  the  optimal  costs  with  their  corresponding 
approximations  in  equation  (12). 

Test  results  indicate  that  ERPS  outperforms  the  discretization-based  PI  algorithm  in 
both  cases,  not  only  in  computational  time  but  also  in  solution  quality.  We  observe  that  the 
computational  time  for  PI  increases  by  a  factor  of  2  for  each  halving  of  the  mesh  size,  while 
the  time  for  ERPS  increases  at  a  much  slower  rate. 
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algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

q0  =  0.25 

2.54  (0.10) 

1.92e-12  (3.64e-13) 

ERPS 

q0  =  0.50 

2.27  (0.09) 

6.41e-13  (7.07e-14) 

(r  =  looo) 

g0  =  0.75 

2.92  (0.08) 

1.92e-13  (2.69e-14) 

g0  =  0.25 

2.61  (0.10) 

4.66e-13  (6.03e-14) 

ERPS 

g0  =  0.50 

2.91  (0.10) 

1.08e-13  (1.59e-14) 

(r  =  8000) 

go  =  0.75 

3.05  (0.11) 

6.84e-14  (1.03e-14) 

g0  =  0.25 

2.84  (0.09) 

1.33e-13  (2.35e-14) 

ERPS 

go  =  0.50 

3.25  (0.10) 

3.06e-14  (4.56e-15) 

(r  =  — 1 — ) 
v  16000  ' 

go  =  0.75 

3.68  (0.10) 

1.89e-14  (2.50e-15) 

h  =  — 

1  4000 

6  (N/A) 

7.96e-09  (N/A) 

h  =  — 

1  8000 

12  (N/A) 

1.72e-09  (N/A) 

PI 

h  =  1 

16000 

23  (N/A) 

4.74e-10  (N/A) 

h  =  1 

32000 

47  (N/A) 

9.52e-ll  (N/A) 

h  =  1 

1  128000 

191  (N/A) 

6.12e-12  (N/A) 

h  =  1 

512000 

781  (N/A) 

3.96e-13  (N/A) 

Table  5:  Comparison  of  the  ERPS  algorithm  (n  =  10,  K  =  10)  with  the  deterministic  PI 
algorithm  for  case  (i).  The  results  of  ERPS  are  based  on  30  independent  replications.  The 
standard  errors  are  in  parentheses. 


6.2  A  Two-Dimensional  Queueing  Example 


The  second  example,  shown  in  Figure  5,  is  a  slight  modification  of  the  first  one,  with  the 
difference  being  that  now  we  have  a  single  queue  that  feeds  two  independent  servers  with 
different  service  completion  probabilities  eq  and  a2.  We  consider  only  the  continuous  action 
space  case.  The  action  to  be  chosen  at  each  state  x  is  (ai,a2)r,  which  takes  value  from  the 
set  A  =  [0, 1]  x  [0, 1].  We  assume  that  an  arrival  that  finds  the  system  empty  will  always  be 
served  by  the  server  with  service  completion  probability  eq.  The  state  space  of  this  problem 
is  X  =  {0,  l51, 1s2,2,  . . .  ,48},  where  we  have  assumed  that  the  maximum  queue  length  (no 
including  those  in  service)  is  46,  and  1^,  ls2  are  used  to  distinguish  the  situations  whether 
server  1  or  server  2  is  busy  when  there  is  only  one  customer  in  the  system.  As  before,  the 
discount  factor  a  =  0.98. 

The  one-stage  cost  is  taken  to  be 


R{y,al,a2 )  =  y  + 


1*1 


cos(7rai)  —  y 


hsi}  + 


1*1  .  ,  V 

—  sm(7ra2)  -  y 
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algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

q0  =  0.25 

2.75  (0.10) 

8.49e-ll  (1.50e-ll) 

ERPS 

q0  =  0.50 

2.91  (0.09) 

1.76e-ll  (2.90e-12) 

(r  =  4000  ) 

g0  =  0.75 

3.16  (0.09) 

8.53e-12  (1.21e-12) 

g0  =  0.25 

3.09  (0.12) 

1.70e-ll  (2.57e-12) 

ERPS 

g0  =  0.50 

3.00  (0.12) 

4.17e-12  (4.94e-13) 

(r  =  8000  ) 

go  =  0.75 

3.62  (0.08) 

1.55e-12  (1.47e-13) 

g0  =  0.25 

3.20  (0.10) 

6.08e-12  (1.17e-12) 

ERPS 

go  =  0.50 

3.28  (0.11) 

1.19e-12  (1.40e-13) 

(r  =  — 1 — ) 

v  16000  ' 

go  =  0.75 

4.20  (0.12) 

4.25e-13  (5.05e-14) 

h  =  — 

1  4000 

6  (N/A) 

2.71e-07  (N/A) 

h  =  — 

1  8000 

11  (N/A) 

5.66e-08  (N/A) 

PI 

h  =  1 

16000 

22  (N/A) 

1.58e-08  (N/A) 

h  =  1 

32000 

43  (N/A) 

5.21e-09  (N/A) 

h  =  1 

1  128000 

176  (N/A) 

3.58e-10  (N/A) 

h  =  1 

512000 

727  (N/A) 

1.71e-ll  (N/A) 

Table  6:  Comparison  of  the  ERPS  algorithm  (n  =  10,  K  =  10)  with  the  deterministic  PI 
algorithm  for  case  (ii).  The  results  of  ERPS  are  based  on  30  independent  replications.  The 
standard  errors  are  in  parentheses. 

where 

f  1  if  server  i  is  busy,  f  1  if  x  G  {lSl,  ls2}> 

l{Si}  —  s  (*  =  1,2),  and  y  =  < 

I  0  otherwise,  I  x  otherwise. 


Figure  5:  A  two-dimensional  queueing  example. 

Again,  in  computing  the  relative  error,  we  approximated  J*  by  ./*,  which  was  computed 
by  using  the  adaptive  ERPS  algorithm  under  the  same  settings  (e.g.,  parameter  settings, 
number  of  replications)  as  in  case  (ii)  of  Section  6.1.2.  The  value  of  ||</*||oo  is  approximately 
1.72e+04. 
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algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

q0  =  0.25 

3.26  (0.14) 

2.60e-06  (1.36e-07) 

ERPS 

q0  =  0.50 

3.20  (0.15) 

1.06e-05  (9.17e-06) 

(r  =  ISo) 

q0  =  0.75 

3.64  (0.14) 

8.98e-05  (2.54e-05) 

q0  =  0.25 

3.37  (0.12) 

6.67e-07  (3.59e-08) 

ERPS 

q0  =  0.50 

3.28  (0.12) 

9.58e-06  (9.20e-06) 

(r  =  25o) 

go  =  0.75 

3.89  (0.17) 

9.38e-05  (2.47e-05) 

g0  =  0.25 

3.78  (0.11) 

1.50e-07  (8.30e-09) 

ERPS 

go  =  0.50 

3.85  (0.12) 

9.30e-06  (9.21e-06) 

(r  =  46o) 

go  =  0.75 

4.45  (0.14) 

4.59e-05  (1.90e-05) 

^  =  Too 

15  (N/A) 

1.65e-04  (N/A) 

PI 

h  =  — 

1  200 

57  (N/A) 

4.30e-05  (N/A) 

h  =  — 

H  400 

226  (N/A) 

8.87e-06  (N/A) 

Table  7:  A  two-dimensional  test  example.  The  results  of  ERPS  are  based  on  30  independent 
replications  (n  =  10,  K  =  10). 

The  performances  of  the  ERPS  and  the  discretization-based  PI  are  reported  in  Table  7. 
In  ERPS,  both  the  population  size  n  and  the  stopping  control  parameter  K  are  set  to  10.  In 
PI,  we  adopt  a  uniform  discretization,  where  the  same  mesh  size  h  is  used  in  both  directions 
of  the  action  space.  Notice  that  the  computational  time  for  PI  increases  by  a  factor  of  4 
for  each  halving  of  the  mesh  size,  whereas  the  time  required  by  ERPS  increases  much  more 
slowly. 

In  Table  8,  we  compare  the  performance  of  the  adaptive  ERPS  algorithm  and  the  original 
ERPS  algorithm  in  obtaining  high  quality  solutions.  In  both  algorithms,  we  choose  the 
population  size  n  =  10,  the  stopping  control  parameter  K  =  10,  and  the  exploitation 
probability  go  =  0.5.  In  adaptive  ERPS,  the  initial  search  range  r  =  0.1,  7  =  2,  parameters 
K\ ,  K'2  and  K:i  are  all  set  to  5,  and  the  improvements  in  elite  policies  are  evaluated  in 
the  infinity-norm.  We  see  that  in  order  to  obtain  more  and  more  accurate  solutions,  the 
search  range  in  ERPS  has  to  be  chosen  excessively  small,  which  causes  significant  increase  in 
computational  effort.  In  contrast,  the  adaptive  ERPS  achieves  better  solutions  within  less 
time;  moreover,  the  algorithm  provides  us  with  a  rough  estimation  of  the  solution  quality: 
as  mentioned  in  Section  5,  the  average  difference  between  the  resultant  value  function  J  and 
the  optimal  cost  J*  (i.e.,  ||J  —  J*||oo)  will  be  of  the  same  order  of  magnitude  as  e%  and  the 


27 


relative  error  can  also  be  estimated  as: 


relerr 


J  ~  .7*11  op  _  1|  J  -  J*  | 
Halloo  ~  ||Jioo 


£ 


algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

||  J  —  J*  || 00  (std  err) 

T  =  — I — 
20000 

16.4  (0.2) 

2.25e-ll  (8.88e-13) 

N/A  (N/A) 

ERPS 

1  ~  40000 

24.8  (0.3) 

5.04e-12  (1.95e-13) 

N/A  (N/A) 

r  —  80000 

39.1  (0.5) 

1.02e-12  (7.18e-14) 

N/A  (N/A) 

Adaptive 

£  =le-07 

13.8  (0.7) 

9.28e-12  (3.22e-12) 

1.59e-07  (5.54e-08) 

ERPS 

£  =le-08 

15.7  (0.8) 

3.95e-13  (1.67e-13) 

6.80e-09  (2.87e-09) 

£  =le-09 

17.1  (0.7) 

1.09e-13  (3.12e-14) 

1.87e-09  (5.37e-10) 

Table  8:  Comparison  of  ERPS  (n  =  10,  K  =  10,  go  =  0.5)  with  adaptive  ERPS  (n  = 
10,  K  =  10,  g0  =  0.5,  r  =  0.1,  Kx  =  K2  =  K3  =  5,  7  =  2),  based  on  30  independent 
replications. 


7  Conclusions  and  Future  Work 

In  this  paper,  we  presented  an  evolutionary,  population-based  method  called  ERPS  for  solv¬ 
ing  infinite  horizon  discounted  cost  MDP  problems.  We  showed  that  the  algorithm  converges 
to  an  optimal  policy  w.p.l.  We  also  illustrated  the  algorithm  by  applying  it  to  two  controlled 
queueing  examples  with  large  or  uncountable  action  spaces.  Numerical  experiments  on  these 
small  examples  indicate  that  the  ERPS  algorithm  is  a  promising  approach,  outperforming 
some  existing  methods  (including  the  standard  policy  iteration  algorithm). 

Many  challenges  remain  to  be  addressed  before  the  algorithm  can  be  applied  to  realistic¬ 
sized  problems.  The  motivation  behind  ERPS  is  the  setting  where  the  action  space  is  ex¬ 
tremely  large  so  that  enumerating  the  entire  action  space  becomes  computationally  imprac¬ 
tical;  however,  the  approach  still  requires  enumerating  the  entire  state  space.  To  make  it 
applicable  to  large  state  space  problems,  the  algorithm  will  probably  need  to  be  used  in 
conjunction  with  some  other  state  space  reduction  techniques  such  as  state  aggregation  or 
value  function  approximation.  This  avenue  of  investigation  clearly  merits  further  research. 

Another  important  issue  is  the  dependence  of  ERPS  on  the  underlying  distance  metric, 
as  determining  a  good  metric  could  be  challenging  for  those  problems  that  do  not  have 
a  natural  metric  already  available.  One  possible  way  to  get  around  this  is  to  adaptively 


updating/changing  the  action  selection  distribution  V  at  each  iteration  of  the  algorithm 
based  on  the  sampling  information  obtained  during  the  previous  iterations.  This  actually 
constitutes  a  learning  process;  the  hope  is  that  more  promising  actions  will  have  larger 
chances  of  being  selected  so  that  the  future  search  will  be  biased  toward  the  region  containing 
high  quality  solutions  (policies). 

Another  practical  issue  is  the  choice  of  the  exploitation  probability  g0.  As  noted  earlier, 
the  parameter  go  serves  as  a  tradeoff  between  exploitation  and  exploration  in  action  selec¬ 
tions.  Preliminary  experimental  results  indicate  some  robustness  with  respect  to  the  value 
of  this  parameter,  in  that  values  between  0.25  and  0.75  all  seem  to  work  well;  however,  this 
may  not  hold  for  larger  problems  or  other  settings,  so  further  investigation  is  required.  One 
approach  is  to  design  a  similar  strategy  as  in  simulated  annealing  algorithms  and  study  the 
behavior  of  the  algorithm  when  the  value  of  go  is  gradually  increasing  from  0  to  1,  which 
corresponds  to  the  transitioning  of  the  search  mechanism  from  pure  random  sampling  to 
pure  local  search. 
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